INTRODUCTION

This is a tutorial that will walk through data science pipeline using a dataset that contains Top 200 Spotify Charts in Europe from 2019 and also weather data for the different regions.

These days we are living an exceptional situation due to Covid-19 pandemic. Many countries are under lockdown as a measure to stop the spread of the virus. Music, as always, is used by a lot of people to cope with several challenges posed by the situation such as uncertainty, boredom or anxiety among others. Everyone consciously choose music that compliments their feelings in a specific moment as music has the ability to evoke powerful emotional responses in listeners. This is not only in times of crisis as the one we are living these days but it has been like this since the begining of time.

TUTORIAL CONTENT

  • Topic and motivation
  • Libraries
  • Data Import
  • Tidy Data
  • Interactive Data Visualization
  • Exploratory Data Analysis
  • Experiment Design and Hypothesis
  • conclusions
  • References

1. TOPIC AND MOTIVATION

Several studies have proved that music choices reflect people´s emotional preferences, so many others have shown that weather have some physiological effects. In this tutorial we will study how Spotify users´ music choices varied based on the time of the year and other music features.

This tutorial will cover the main data science activities applied to the chosen dataset. These activities are organized and shown in the following workflow:

2. LIBRARIES IMPORT

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(base)
library(ggplot2)
library(RColorBrewer)
library(broom)
library(leaflet)
library(DataExplorer)
library(knitr)

3. DATA IMPORT

After installing and importing the libraries that will be needed, the next step is to load the data. Nowadays there are a lot of data repositories available in the internet. A great source of datasets and commonly used is Kaggle. Kaggle is the repository from where the dataset that will be used in this turorial was downloaded. Data can be found here.

Another dataset will be used. This second dataset contains countries with their (ISO 3166-1) Alpha-2 code, Alpha-3 code, UN M49, average latitude and longitude. It can be found here. We will use this dataset for interactive data visualization.

Once the datasets have been downloaded and saved in the same folder as this notebook, it is time to load these datasets into the environment.

It is really common to find datasets in comma-separated value (.csv) files. Each line in these files contains attribute values separated by commas. As this format is commonly used, Tydyverse library contains readr R package that provides the read_csv command. read_csv allows to read a dataset stored in a csv file.

We assign to variables called data and countries the results of calling read_csv.

data <- read_csv("final_spotify.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   region = col_character(),
##   date = col_date(format = ""),
##   spotify_id = col_character(),
##   artist = col_character(),
##   track_name = col_character()
## )
## See spec(...) for full column specifications.
countries <-read_csv("countries_codes_and_coordinates.csv")
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   `Alpha-2 code` = col_character(),
##   `Alpha-3 code` = col_character(),
##   `Numeric code` = col_double(),
##   `Latitude (average)` = col_double(),
##   `Longitude (average)` = col_double()
## )
head(data)
## # A tibble: 6 x 26
##      X1 region date       month spotify_id artist track_name position streams
##   <dbl> <chr>  <date>     <dbl> <chr>      <chr>  <chr>         <dbl>   <dbl>
## 1     0 AUT    2019-01-02     1 7KPGeiXWD… Capit… Benzema           1   35174
## 2     1 AUT    2019-01-03     1 7KPGeiXWD… Capit… Benzema           1   34237
## 3     2 AUT    2019-01-04     1 7KPGeiXWD… Capit… Benzema           1   35199
## 4     3 AUT    2019-01-05     1 7KPGeiXWD… Capit… Benzema           1   32559
## 5     4 AUT    2019-01-06     1 7KPGeiXWD… Capit… Benzema           1   26956
## 6     5 AUT    2019-01-07     1 7KPGeiXWD… Capit… Benzema           1   35018
## # … with 17 more variables: danceability <dbl>, energy <dbl>,
## #   instrumentalness <dbl>, key <dbl>, liveness <dbl>, loudness <dbl>,
## #   speechiness <dbl>, acousticness <dbl>, tempo <dbl>, valence <dbl>,
## #   explicit <dbl>, temp <dbl>, rain <dbl>, snow <dbl>, cloud <dbl>,
## #   humidity <dbl>, const <dbl>
head(countries)
## # A tibble: 6 x 6
##   Country `Alpha-2 code` `Alpha-3 code` `Numeric code` `Latitude (aver…
##   <chr>   <chr>          <chr>                   <dbl>            <dbl>
## 1 Afghan… AF             AFG                         4             33  
## 2 Albania AL             ALB                         8             41  
## 3 Algeria DZ             DZA                        12             28  
## 4 Americ… AS             ASM                        16            -14.3
## 5 Andorra AD             AND                        20             42.5
## 6 Angola  AO             AGO                        24            -12.5
## # … with 1 more variable: `Longitude (average)` <dbl>

4. TIDY DATA

To be able to use a dataset for analysis it is necessary to prepare and organize it.

Now that the dataset has been imported, we can take a look at it and learn more about what contains.

To learn the names of the attributes of this dataset, we can use colnames command.

colnames(data)
##  [1] "X1"               "region"           "date"             "month"           
##  [5] "spotify_id"       "artist"           "track_name"       "position"        
##  [9] "streams"          "danceability"     "energy"           "instrumentalness"
## [13] "key"              "liveness"         "loudness"         "speechiness"     
## [17] "acousticness"     "tempo"            "valence"          "explicit"        
## [21] "temp"             "rain"             "snow"             "cloud"           
## [25] "humidity"         "const"
colnames(countries)
## [1] "Country"             "Alpha-2 code"        "Alpha-3 code"       
## [4] "Numeric code"        "Latitude (average)"  "Longitude (average)"

Also, we can know more about the different attributes using class (provides the type of a certain attribute), factor (converts an attribute to type facto, categorical) or summary (gives information about an attribute of type factor) commands. Geneally the names of the attributes should be representative, however, sometimes they are not. For example, in this dataset “X1” or “const” do not say much about what they represent. Using the commands previously mentioned we will figure it out.

class(data$X1)
## [1] "numeric"
head(data$X1)
## [1] 0 1 2 3 4 5
summary(factor(data$const))
##       1 
## 1425909

After getting some information about the attributes, now we know that “X1” contains just the index of each entity in the dataset and “const” is 1 for every entity.

Now that we know more about our dataset it is time to have our datasets represented in a form that are amenable for manipulation and statistical modeling.

We want our data to be based on rectangular data structures where:

  1. Each attribute (or variable) forms a column
  2. Each entity (or observation) forms a row
  3. Each type of entity (observational unit) forms a table

To have a perfectly tidy dataset, we will just eliminate the attributes “const” and “X1” as they do not provide any information about the different entities. Also, we noticed that “month” attribute provides information that can be extracted from “date” so it will also be eliminated. For our analysis, “spotify_id” it is not necessary as it does not provide any useful information so we eliminate it as well.

The dataset countries will just be necessary for interactive data visualization so only “Alpha-3 code”, “Longitude (average)” and “Latitude (average)” attributes are selected.

data<-data%>%
  select(-X1,-const, -month, -spotify_id)
head(data)
## # A tibble: 6 x 22
##   region date       artist track_name position streams danceability energy
##   <chr>  <date>     <chr>  <chr>         <dbl>   <dbl>        <dbl>  <dbl>
## 1 AUT    2019-01-02 Capit… Benzema           1   35174         0.78   0.69
## 2 AUT    2019-01-03 Capit… Benzema           1   34237         0.78   0.69
## 3 AUT    2019-01-04 Capit… Benzema           1   35199         0.78   0.69
## 4 AUT    2019-01-05 Capit… Benzema           1   32559         0.78   0.69
## 5 AUT    2019-01-06 Capit… Benzema           1   26956         0.78   0.69
## 6 AUT    2019-01-07 Capit… Benzema           1   35018         0.78   0.69
## # … with 14 more variables: instrumentalness <dbl>, key <dbl>, liveness <dbl>,
## #   loudness <dbl>, speechiness <dbl>, acousticness <dbl>, tempo <dbl>,
## #   valence <dbl>, explicit <dbl>, temp <dbl>, rain <dbl>, snow <dbl>,
## #   cloud <dbl>, humidity <dbl>
countries<-countries%>%
  select(region=`Alpha-3 code`, lng=`Longitude (average)`, lat=`Latitude (average)`)
head(countries)
## # A tibble: 6 x 3
##   region    lng   lat
##   <chr>   <dbl> <dbl>
## 1 AFG      65    33  
## 2 ALB      20    41  
## 3 DZA       3    28  
## 4 ASM    -170   -14.3
## 5 AND       1.6  42.5
## 6 AGO      18.5 -12.5

5. INTERACTIVE DATA VISUALIZATION

An interesting way of understanding the dataset is using interactive data visualization. For example, with our datasets we can show in a map, the total number of streams of each country. To have a clean map, we can use colors and different radius markers. In this case, if the country have more than 9000 million streams, the radius is 12 and the color of the market is red, if the country have between 1200 and 9000 million streams, the radius is 5 and the color of the market is blue, and color is green and radius is 2 otherwise. Besides, a label with the total number of streams in millions is shown when the cursor is on top of a specific marker.

To get the interactive map we will use leaflet library.

dataC<-left_join(data,countries, by="region")

data_reg<- dataC %>%
  select(region,streams, lng, lat) %>% 
  group_by(region, lng, lat) %>%
  summarise(Total_Streams = sum(streams)) %>%
  arrange(desc(Total_Streams),region) %>% 
  mutate(millions = Total_Streams/1000000)

getColor <- function(data_reg) {
  sapply(data_reg$millions, function(millions) {
  if(millions > 9000) {
    "red"
  } else if(millions > 1200 & millions<=9000) {
    "blue"
  } else {
    "green"
  } })
}

getRadius<-function(data_reg) {
  sapply(data_reg$millions, function(millions) {
  if(millions > 9000) {
    12
  } else if(millions > 1200 & millions<=9000) {
    5
  } else {
    2
  } })
}
map2<-leaflet(data_reg) %>%
  addTiles() %>%
  setView(lat=53, lng=9, zoom=3)%>%
  addCircleMarkers(~lng,~lat, color=getColor(data_reg),radius = getRadius(data_reg), label = data_reg$millions)
  
map2

6. EDA (EXPLORATORY DATA ANALYSIS)

With the data filtered and tidy, it is time to do some exploratory data analysis to have a better understanding of the different attributes.

6.2. TOTAL STREAMS PER SEASON

As we are more interested in year season than in each month, we add a column to the dataset containing the year season.

Then, we can show the total number of streams per season.

seasons<-data%>%
  mutate(Season=case_when(
    date<as.Date("2019-03-19") | date>=as.Date("2019-12-22") ~ "Winter",
    date<as.Date("2019-06-20") & date>=as.Date("2019-03-19") ~ "Spring",
    date<as.Date("2019-09-22") & date>=as.Date("2019-06-20") ~ "Summer",
    date<as.Date("2019-12-22") & date>=as.Date("2019-09-22") ~ "Fall"
))

seasons_streams<-seasons%>%
  group_by(Season)%>%
  summarise(streams=sum(streams)/1000000)

ggplot(seasons_streams, aes(x=Season, y=streams, fill=Season, label=streams)) +
       geom_bar(width = 1, stat = "identity")+ggtitle("Streams per Season")

As we already conclude, during summer more songs are played.

6.3. TOP 10 SONGS AND ARTISTS

From the dataset we can determine the 10 songs that have been longer in the top 10 in 2019 and the top 10 artists, based on the total number of streams of all their songs contained in the dataset.

popular<-data%>%
  filter(position<=10)%>%
  mutate(month=format.Date(date,"%m"))%>%
  select(track_name,month, position)%>%
  group_by(track_name)%>%
  summarise(totStreams=n())%>%
  arrange(desc(totStreams))%>%
  slice(1:10)


  pie(popular$totStreams, popular$track_name, border="white", col=brewer.pal(10,"Set3"), main="The 10 Songs Longer in the Top 10")

m_s_a_top10 <- data%>% 
  select(artist,streams)%>%
  group_by(artist)%>%
  summarise(Streams=sum(streams)/1000000)%>%
  arrange(desc(Streams))%>%
  slice(1:10)

m_s_a_top10$artist <- factor(m_s_a_top10$artist, levels=m_s_a_top10$artist[order(desc(m_s_a_top10$Streams))])

ggplot(m_s_a_top10, aes(x=artist, y=Streams))+ geom_bar(stat="identity", fill="navy")+theme(axis.text.x = element_text(face = "bold", angle = 45, hjust=1))+ ggtitle("Most Listened Artists")+xlab("Artist")+ ylab("Total number of streams")

From the two previous figures, we can conclude that the songs that have been longer in the top 10 do not necessarily correspond to the top 10 artists.

6.4. SONGS FEATURES

Now we are going to focus on the different features.

First of all we are going to compare the features of the top 10 songs.

m_s_t_top10 <-  left_join(popular,data,by = "track_name")

m_s_t_top10_2 <- m_s_t_top10 %>% select(track_name,totStreams,danceability,energy,speechiness,acousticness,liveness,valence) %>% arrange(desc(totStreams))


ggplot(m_s_t_top10_2, aes(track_name)) +
  geom_line(aes(y=danceability, color="danceability",group=1)) +
  geom_line(aes(y=energy, color="energy",group=2)) +
  geom_line(aes(y=speechiness, color="speechiness",group=3)) +
  geom_line(aes(y=acousticness, color="acousticness",group=4)) +
  geom_line(aes(y=liveness, color="liveness",group=5)) +
  geom_line(aes(y=valence, color="valence",group=6)) +
  xlab("Tracks") + ylab("Features") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Feature Comparison of Top 10 Songs")

From the previous graph we can conclude that most of these songs have similar values for the different features. However, for example valence is the feature that varies the most.

As we said, we want to focus on the different seasons so next, we are going to show the mean value of each feature per season.

Danceability

ft<-seasons%>%
  select(-position,-temp,-snow,-cloud,-humidity)%>%
  group_by(Season,month=format.Date(date,"%m"))%>%
  summarise(streams=sum(streams)/1000000, danceability=mean(danceability), energy=mean(energy),key=mean(key), liveness=mean(liveness), loudness=mean(loudness), speechiness=mean(speechiness), acousticness=mean(acousticness), tempo=mean(tempo), valence=mean(valence))


ggplot(ft, aes(y=danceability, x=month, color=Season))+ geom_point()+ggtitle("Danceability per month")+xlab("Month") + ylab("Danceability")

Energy

ggplot(ft, aes(y=energy, x=month, color=Season))+ geom_point()+ggtitle("Energy per month")+ xlab("Month") + ylab("Energy")

Key

ggplot(ft, aes(y=key, x=month, color=Season))+ geom_point()+ggtitle("Key per month")+xlab("Month") + ylab("Key")

Liveness

ggplot(ft, aes(y=liveness, x=month, color=Season))+ geom_point()+ggtitle("Liveness per month")+xlab("Month") + ylab("Liveness")

Speechness

ggplot(ft, aes(y=speechiness, x=month, color=Season))+ geom_point()+ggtitle("Speechiness per month")+xlab("Month") + ylab("Speechness")

Acousticness

ggplot(ft, aes(y=acousticness, x=month, color=Season))+ geom_point()+ggtitle("Acousticness per month")+xlab("Month") + ylab("Acousticness")

Loudness

ggplot(ft, aes(y=loudness, x=month, color=Season))+ geom_point()+ggtitle("Loudness per month")+xlab("Month") + ylab("Loudness")

Tempo

ggplot(ft, aes(y=tempo, x=month, color=Season))+ geom_point()+ggtitle("Tempo per month")+xlab("Month") + ylab("Tempo")

Valence

ggplot(ft, aes(y=valence, x=month, color=Season))+ geom_point() +ggtitle("Valence per month")+xlab("Month") + ylab("Valence")

By looking at the previous graphs, we can tell that depending on the season, songs with certain features are more common. For example, at first sight, during summer season users commonly choose more danceable and positive (valence) music while during winter occurs the opposite.

More specifically, by what we observed on the previous figures, we are going to focus on “danceability”, “energy” and “valence”.

ggplot(seasons,mapping=aes(group=Season, y=danceability, x=Season, color=Season))+ geom_boxplot() + stat_summary(fun.y=mean, geom="point", color="red", fill="red", size=0.5) + ggtitle("Danceability per Season") + ylab("Danceability")

ggplot(seasons,mapping=aes(group=Season, y=energy, x=Season, color=Season))+ geom_boxplot() + stat_summary(fun.y=mean, geom="point", color="red", fill="red", size=0.5) + ggtitle("Energy per Season") + ylab("Energy")

ggplot(seasons,mapping=aes(group=Season, y=valence, x=Season, color=Season))+ geom_boxplot() + stat_summary(fun.y=mean, geom="point", color="red", fill="red", size=0.5) + ggtitle("Valence per Season") + ylab("Valence")

These three boxplots, supports the previous conclusion as the mean of the proposed features changes depending on the season and it is higher during summer and less during winter.

ss<-seasons%>%
  select(-position,-temp,-snow,-cloud,-humidity)%>%
  group_by(Season)%>%
  summarise(streams=sum(streams)/1000000, danceability=mean(danceability,,na.rm = TRUE), energy=mean(energy,,na.rm = TRUE),key=mean(key,na.rm = TRUE), liveness=mean(liveness,na.rm = TRUE), loudness=mean(loudness,na.rm = TRUE), speechiness=mean(speechiness,na.rm = TRUE), acousticness=mean(acousticness,na.rm = TRUE), tempo=mean(tempo,,na.rm = TRUE), valence=mean(valence,,na.rm = TRUE))

ggplot(ss,mapping=aes(x=Season, y=danceability, group=1)) + geom_line(color="red") + geom_point(size=0.1) + ylab("Danceability") + ggtitle("Mean Danceability per Season")

ggplot(ss,mapping=aes(x=Season, y=energy, group=1)) + geom_line(color="red") + geom_point(size=0.1) + ylab("Energy") + ggtitle("Mean Energy per Season")

ggplot(ss,mapping=aes(x=Season, y=valence, group=1)) + geom_line(color="red") + geom_point(size=0.1) + ylab("Valence") + ggtitle("Mean Valence per Season")

These plots shows the central tendency of the data for the three features we are interested in. On these plots it is shown how the value of the three features changes with the season. We will focus on the first two as we can see there is an special relationship as apparently the value increases from winter to summer.

7. EXPERIMENT DESIGN AND HYPOTHESIS TESTING

Now that we have looked at the different attributes and we have completed our analysis and visualization, we are going to do linear regression and hypothesis testing.

From the previous analysis, we believe there is a relationship between certain features and year seasons. To determine if this relation is true, we will test if the null hypothesis of an existing correlation between some features and year season is or not rejected.

ggplot(ss,mapping=aes(y=danceability, x=energy)) + geom_point(aes(color=Season)) + geom_smooth(method=lm) + ylab("Danceability") + xlab("Energy") + ggtitle("Relationship between Danceability, Energy and Season")

In scatter plot, it is shown that energy and danceability are linearly dependent. Also, it is shown how the danceable songs are not really common in winter but increases as summer arrives.

7.1. REGRESSION MODELS

First, we will create two different linear regression models. One between danceability and seasons and the other between energy and seasons.

Danceability

mod<-seasons%>%
  select(-position,-temp,-snow,-cloud,-humidity)%>%
  group_by(Season,month=format.Date(date,"%m"))%>%
  summarise(danceability=mean(danceability, na.rm = TRUE), energy=mean(energy, na.rm = TRUE))

lm_d <-lm(danceability ~ Season, data=mod)
lm_d %>%
  tidy() %>%
  select(term, estimate, std.error)
## # A tibble: 4 x 3
##   term         estimate std.error
##   <chr>           <dbl>     <dbl>
## 1 (Intercept)  0.685      0.00565
## 2 SeasonSpring 0.0103     0.00799
## 3 SeasonSummer 0.0168     0.00799
## 4 SeasonWinter 0.000972   0.00799

According to the model, danceability increases from fall until summer not constantly.

Energy

lm_e <-lm(energy ~ Season, data=mod)
lm_e %>%
  tidy() %>%
  select(term, estimate, std.error)
## # A tibble: 4 x 3
##   term         estimate std.error
##   <chr>           <dbl>     <dbl>
## 1 (Intercept)   0.628     0.00759
## 2 SeasonSpring  0.0165    0.0107 
## 3 SeasonSummer  0.0246    0.0107 
## 4 SeasonWinter  0.00403   0.0107

According with this other model energy also increases from fall when is the lowest until summer, once again not constantly.

7.2. GLOBAL STATISTICS OF THE MODELS

We can also get some global statistics of our models.

Danceability

lm_d %>%
  glance() %>%
  select(r.squared, sigma, statistic, df, p.value)
## # A tibble: 1 x 5
##   r.squared  sigma statistic    df p.value
##       <dbl>  <dbl>     <dbl> <int>   <dbl>
## 1     0.335 0.0113      2.01     4   0.166

Energy

lm_e %>%
  glance() %>%
  select(r.squared, sigma, statistic, df, p.value)
## # A tibble: 1 x 5
##   r.squared  sigma statistic    df p.value
##       <dbl>  <dbl>     <dbl> <int>   <dbl>
## 1     0.358 0.0152      2.23     4   0.138

7.3. RESIDUALS VS FITTED MODEL

To finish, we are going to show the residuals distribution of the models.

Danceability

aug <- lm_d %>%
  augment()

ggplot(aug,aes(x=.fitted,y=.resid)) +
    geom_point() + 
    geom_smooth() +
    labs(x="fitted", y="residual", title="Residuals vs Fitted. Danceability Model")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Energy

aug <- lm_d %>%
  augment()

ggplot(aug,aes(x=.fitted,y=.resid)) +
    geom_point() + 
    geom_smooth() +
    labs(x="fitted", y="residual", title="Residuals vs Fitted. Energy Model")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As we can see above, the residuals in both models cluster around 0. Therefore, we can say that there is a linear relationship between danceability and energy and year season.

8. CONCLUSIONS

As we anticipated at the begining of this tutorial, music can help to connect and understand emotions, but it can also trigger new feelings of euphoria and preparation for a specific event. The hypothesis analysis carried out shows how people tend to choose more energetic and danceable music as summer arrives as for most people summer is a synonym of holiday and incite to be in a good mood.